This dataset contains 119390 observations for a City Hotel and a Resort Hotel. Each observation represents a hotel booking between the 1st of July 2015 and 31st of August 2017, including booking that effectively arrived and booking that were canceled.
Since this is hotel real data, all data elements pertaining hotel or costumer identification were deleted.Four Columns, 'name', 'email', 'phone number' and 'credit_card' have been artificially created and added to the dataset.
The data is originally from the article Hotel Booking Demand Datasets, written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.
It would be nice for the hotels to have a model to predict if a guest will actually come.
This can help a hotel to plan things like personel and food requirements.
Maybe some hotels also use such a model to offer more rooms than they have to make more money... who knows...
This project is a general project to explore and analyze this data without having a major problem to solve, so you will find that I worked on most of the variables and the relationships between them to explore and analyze most of the data
# import the libraries that i will be using
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.graph_objs as go
from plotly.offline import iplot
import plotly.express as px
%matplotlib inline
# import data by ``` pd.read_csv ``` function
df = pd.read_csv('hotel_bookings.csv')
# read count of columns and rows
df.shape
(119390, 32)
# read columns and data type
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 119390 entries, 0 to 119389 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel 119390 non-null object 1 is_canceled 119390 non-null int64 2 lead_time 119390 non-null int64 3 arrival_date_year 119390 non-null int64 4 arrival_date_month 119390 non-null object 5 arrival_date_week_number 119390 non-null int64 6 arrival_date_day_of_month 119390 non-null int64 7 stays_in_weekend_nights 119390 non-null int64 8 stays_in_week_nights 119390 non-null int64 9 adults 119390 non-null int64 10 children 119386 non-null float64 11 babies 119390 non-null int64 12 meal 119390 non-null object 13 country 118902 non-null object 14 market_segment 119390 non-null object 15 distribution_channel 119390 non-null object 16 is_repeated_guest 119390 non-null int64 17 previous_cancellations 119390 non-null int64 18 previous_bookings_not_canceled 119390 non-null int64 19 reserved_room_type 119390 non-null object 20 assigned_room_type 119390 non-null object 21 booking_changes 119390 non-null int64 22 deposit_type 119390 non-null object 23 agent 103050 non-null float64 24 company 6797 non-null float64 25 days_in_waiting_list 119390 non-null int64 26 customer_type 119390 non-null object 27 adr 119390 non-null float64 28 required_car_parking_spaces 119390 non-null int64 29 total_of_special_requests 119390 non-null int64 30 reservation_status 119390 non-null object 31 reservation_status_date 119390 non-null object dtypes: float64(4), int64(16), object(12) memory usage: 29.1+ MB
# see head of data
df.head(10)
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 7/1/2015 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 7/1/2015 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 7/2/2015 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 7/2/2015 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 7/3/2015 |
| 5 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 7/3/2015 |
| 6 | Resort Hotel | 0 | 0 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 107.0 | 0 | 0 | Check-Out | 7/3/2015 |
| 7 | Resort Hotel | 0 | 9 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 303.0 | NaN | 0 | Transient | 103.0 | 0 | 1 | Check-Out | 7/3/2015 |
| 8 | Resort Hotel | 1 | 85 | 2015 | July | 27 | 1 | 0 | 3 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 82.0 | 0 | 1 | Canceled | 5/6/2015 |
| 9 | Resort Hotel | 1 | 75 | 2015 | July | 27 | 1 | 0 | 3 | 2 | ... | No Deposit | 15.0 | NaN | 0 | Transient | 105.5 | 0 | 0 | Canceled | 4/22/2015 |
10 rows × 32 columns
# see mean of Average Daily Rate (ADR)
df.adr.mean()
101.83112153446453
# see the unique value of arrival_date_day_of_month
df.arrival_date_day_of_month.unique()
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31],
dtype=int64)
# see the unique value of stays_in_weekend_nights
df.stays_in_weekend_nights.unique()
array([ 0, 1, 2, 4, 3, 6, 13, 8, 5, 7, 12, 9, 16, 18, 19, 10, 14],
dtype=int64)
# see the value of hotel
df.hotel.value_counts()
City Hotel 79330 Resort Hotel 40060 Name: hotel, dtype: int64
# see some summarize basic statistical details
df.describe()
| is_canceled | lead_time | arrival_date_year | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | booking_changes | agent | company | days_in_waiting_list | adr | required_car_parking_spaces | total_of_special_requests | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119386.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 103050.000000 | 6797.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 |
| mean | 0.370416 | 104.011416 | 2016.156554 | 27.165173 | 15.798241 | 0.927599 | 2.500302 | 1.856403 | 0.103890 | 0.007949 | 0.031912 | 0.087118 | 0.137097 | 0.221124 | 86.693382 | 189.266735 | 2.321149 | 101.831122 | 0.062518 | 0.571363 |
| std | 0.482918 | 106.863097 | 0.707476 | 13.605138 | 8.780829 | 0.998613 | 1.908286 | 0.579261 | 0.398561 | 0.097436 | 0.175767 | 0.844336 | 1.497437 | 0.652306 | 110.774548 | 131.655015 | 17.594721 | 50.535790 | 0.245291 | 0.792798 |
| min | 0.000000 | 0.000000 | 2015.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 | 0.000000 | -6.380000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 18.000000 | 2016.000000 | 16.000000 | 8.000000 | 0.000000 | 1.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 9.000000 | 62.000000 | 0.000000 | 69.290000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 69.000000 | 2016.000000 | 28.000000 | 16.000000 | 1.000000 | 2.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 14.000000 | 179.000000 | 0.000000 | 94.575000 | 0.000000 | 0.000000 |
| 75% | 1.000000 | 160.000000 | 2017.000000 | 38.000000 | 23.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 229.000000 | 270.000000 | 0.000000 | 126.000000 | 0.000000 | 1.000000 |
| max | 1.000000 | 737.000000 | 2017.000000 | 53.000000 | 31.000000 | 19.000000 | 50.000000 | 55.000000 | 10.000000 | 10.000000 | 1.000000 | 26.000000 | 72.000000 | 21.000000 | 535.000000 | 543.000000 | 391.000000 | 5400.000000 | 8.000000 | 5.000000 |
# Are there a null value ?
df.isnull().sum()
hotel 0 is_canceled 0 lead_time 0 arrival_date_year 0 arrival_date_month 0 arrival_date_week_number 0 arrival_date_day_of_month 0 stays_in_weekend_nights 0 stays_in_week_nights 0 adults 0 children 4 babies 0 meal 0 country 488 market_segment 0 distribution_channel 0 is_repeated_guest 0 previous_cancellations 0 previous_bookings_not_canceled 0 reserved_room_type 0 assigned_room_type 0 booking_changes 0 deposit_type 0 agent 16340 company 112593 days_in_waiting_list 0 customer_type 0 adr 0 required_car_parking_spaces 0 total_of_special_requests 0 reservation_status 0 reservation_status_date 0 dtype: int64
# Are there irrational values?
filter =df.query('children == 0 & babies == 0 & adults == 0 ')
filter
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2224 | Resort Hotel | 0 | 1 | 2015 | October | 41 | 6 | 0 | 3 | 0 | ... | No Deposit | NaN | 174.0 | 0 | Transient-Party | 0.00 | 0 | 0 | Check-Out | 10/6/2015 |
| 2409 | Resort Hotel | 0 | 0 | 2015 | October | 42 | 12 | 0 | 0 | 0 | ... | No Deposit | NaN | 174.0 | 0 | Transient | 0.00 | 0 | 0 | Check-Out | 10/12/2015 |
| 3181 | Resort Hotel | 0 | 36 | 2015 | November | 47 | 20 | 1 | 2 | 0 | ... | No Deposit | 38.0 | NaN | 0 | Transient-Party | 0.00 | 0 | 0 | Check-Out | 11/23/2015 |
| 3684 | Resort Hotel | 0 | 165 | 2015 | December | 53 | 30 | 1 | 4 | 0 | ... | No Deposit | 308.0 | NaN | 122 | Transient-Party | 0.00 | 0 | 0 | Check-Out | 1/4/2016 |
| 3708 | Resort Hotel | 0 | 165 | 2015 | December | 53 | 30 | 2 | 4 | 0 | ... | No Deposit | 308.0 | NaN | 122 | Transient-Party | 0.00 | 0 | 0 | Check-Out | 1/5/2016 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 115029 | City Hotel | 0 | 107 | 2017 | June | 26 | 27 | 0 | 3 | 0 | ... | No Deposit | 7.0 | NaN | 0 | Transient | 100.80 | 0 | 0 | Check-Out | 6/30/2017 |
| 115091 | City Hotel | 0 | 1 | 2017 | June | 26 | 30 | 0 | 1 | 0 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.00 | 1 | 1 | Check-Out | 7/1/2017 |
| 116251 | City Hotel | 0 | 44 | 2017 | July | 28 | 15 | 1 | 1 | 0 | ... | No Deposit | 425.0 | NaN | 0 | Transient | 73.80 | 0 | 0 | Check-Out | 7/17/2017 |
| 116534 | City Hotel | 0 | 2 | 2017 | July | 28 | 15 | 2 | 5 | 0 | ... | No Deposit | 9.0 | NaN | 0 | Transient-Party | 22.86 | 0 | 1 | Check-Out | 7/22/2017 |
| 117087 | City Hotel | 0 | 170 | 2017 | July | 30 | 27 | 0 | 2 | 0 | ... | No Deposit | 52.0 | NaN | 0 | Transient | 0.00 | 0 | 0 | Check-Out | 7/29/2017 |
180 rows × 32 columns
filter.shape
(180, 32)
# drop the rows that don't have value on guests
df = df.drop(filter.index)
# drop the null value
df.fillna(0 , inplace =True)
df.query('children == 0 & babies == 0 & adults == 0 ')
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date |
|---|
0 rows × 32 columns
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 119210 entries, 0 to 119389 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel 119210 non-null object 1 is_canceled 119210 non-null int64 2 lead_time 119210 non-null int64 3 arrival_date_year 119210 non-null int64 4 arrival_date_month 119210 non-null object 5 arrival_date_week_number 119210 non-null int64 6 arrival_date_day_of_month 119210 non-null int64 7 stays_in_weekend_nights 119210 non-null int64 8 stays_in_week_nights 119210 non-null int64 9 adults 119210 non-null int64 10 children 119210 non-null float64 11 babies 119210 non-null int64 12 meal 119210 non-null object 13 country 119210 non-null object 14 market_segment 119210 non-null object 15 distribution_channel 119210 non-null object 16 is_repeated_guest 119210 non-null int64 17 previous_cancellations 119210 non-null int64 18 previous_bookings_not_canceled 119210 non-null int64 19 reserved_room_type 119210 non-null object 20 assigned_room_type 119210 non-null object 21 booking_changes 119210 non-null int64 22 deposit_type 119210 non-null object 23 agent 119210 non-null float64 24 company 119210 non-null float64 25 days_in_waiting_list 119210 non-null int64 26 customer_type 119210 non-null object 27 adr 119210 non-null float64 28 required_car_parking_spaces 119210 non-null int64 29 total_of_special_requests 119210 non-null int64 30 reservation_status 119210 non-null object 31 reservation_status_date 119210 non-null object dtypes: float64(4), int64(16), object(12) memory usage: 30.0+ MB
Here i will exploring the data by answer some question like :
questions¶
- Is there relationship between having babies and the rate of cancellation ? (Statistically)
- Is cancellation of reservations highest in the resort about the city hotel? (Statistically)
- what the kind of hotel that have highest reservations?
- Wich year has the highest number of bookings ?
- Distribution of staye in week nights on count of days
- Distribution of staye in weekend nights on count of days
- DISTRIBUTION OF count leaad time for reservations
- What is the most kind of Board on Reservations ?
- What are the countries with the highest bookings?
df.head(3)
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | 0.0 | 0.0 | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 7/1/2015 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | 0.0 | 0.0 | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 7/1/2015 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 0.0 | 0.0 | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 7/2/2015 |
3 rows × 32 columns
df.is_canceled.value_counts()
0 75011 1 44199 Name: is_canceled, dtype: int64
df.is_canceled.mean()
0.370765875346028
no_baby =df.query('babies == 0')
no_baby.is_canceled.mean()
0.37222828062522717
one_baby =df.query('babies == 1')
one_baby.is_canceled.mean()
0.18333333333333332
one_baby =df.query('babies == 2')
one_baby.is_canceled.mean()
0.13333333333333333
df.babies.value_counts()
0 118293 1 900 2 15 10 1 9 1 Name: babies, dtype: int64
# the unique value of hotel
df.hotel.unique()
array(['Resort Hotel', 'City Hotel'], dtype=object)
# mean cancellation of reservations for [ Resort Hotel ]
df.query('hotel == "Resort Hotel"').is_canceled.mean()
0.27767373336329815
# mean cancellation of reservations for [ city Hotel ]
df.query('hotel == "City Hotel"').is_canceled.mean()
0.4178593534858457
# mean cancellation of reservations for all data
df.is_canceled.mean()
0.370765875346028
df.is_canceled.value_counts()
0 75011 1 44199 Name: is_canceled, dtype: int64
# create countplot by seaborn to see destributions of hotels
fig = plt.subplots( figsize = [10,5])
default_color = sb.color_palette()[0]
sb.countplot(data=df ,x ='hotel' , color=default_color)
plt.title('count of reservvations on every hotel')
Text(0.5, 1.0, 'count of reservvations on every hotel')
# creae a bar plot for count of reservations of years
fig = plt.subplots( figsize = [10,5])
sb.countplot(data = df , x='arrival_date_year')
plt.title('count of reservations for every year')
Text(0.5, 1.0, 'count of reservations for every year')
df.reservation_status_date.min() , df.reservation_status_date.max()
('1/1/2015', '9/9/2017')
fig = plt.subplots( figsize = [11,6])
default_color = sb.color_palette()[0]
sb.countplot(data = df , x='arrival_date_month' , color =default_color)
plt.title('Distribution of reservations throughout the months of the year')
plt.xticks(rotation =70);
fig, ax = plt.subplots( figsize = [15,6])
default_color = sb.color_palette()[0]
sb.countplot(data=df ,x='stays_in_week_nights',color=default_color );
plt.title('Distribution of staye in week nights')
Text(0.5, 1.0, 'Distribution of staye in week nights')
df.query('stays_in_week_nights > 11');
ax = plt.subplots( figsize = [12,6])
sb.countplot(data=df ,x='stays_in_weekend_nights',color=default_color )
<AxesSubplot:xlabel='stays_in_weekend_nights', ylabel='count'>
Booking Lead Time* is the period of time between when a guest makes a reservation, and the actual check-in date. If a guest makes a reservation on March 1 and the check-in date is Mar 30, then the booking lead time for that reservation is 30
plt.figure(figsize=[9,6])
bins = np.arange(0,df.lead_time.max() ,25)
plt.hist(data = df , x='lead_time' , bins = bins );
plt.xlabel('lead_time')
plt.ylabel('frequency of lead time ')
plt.title('DISTRIBUTION OF count leaad time for reservations')
Text(0.5, 1.0, 'DISTRIBUTION OF count leaad time for reservations')
SC (Self Catering) : No meals are included; however, your accommodation will be provided with catering facilities for you to cook light meals.
BB (Bed and Breakfast) : Breakfast is included.
HB (Half Board) : Breakfast and evening meals are included. In some cases, you can choose to receive lunch instead of breakfast – the hotel will confirm this on arrival.
FB (Full Board) : Breakfast, lunch and evening meals are included.
df.meal.value_counts().plot(kind ='bar', figsize =(8,5))
plt.xlabel('kind of board meals')
plt.ylabel('count of meal')
plt.title(' the most kind of Board on Reservations')
Text(0.5, 1.0, ' the most kind of Board on Reservations')
df.country.value_counts()[:10].plot(kind = 'bar' ,figsize=(12,6))
plt.xlabel('Country')
plt.ylabel('count of bookings')
plt.title(' the countries with the highest bookings')
Text(0.5, 1.0, ' the countries with the highest bookings')
df.groupby('arrival_date_year')['lead_time'].mean().plot(kind='bar' , figsize=(8,4))
plt.xlabel('Years')
plt.ylabel('avg_cancellation')
plt.title('The difference in the average cancellation of reservations between years ')
Text(0.5, 1.0, 'The difference in the average cancellation of reservations between years ')
x=df.groupby('arrival_date_month')['adr'].mean()
x.plot(kind='bar' , figsize=(10,5) )
<AxesSubplot:xlabel='arrival_date_month'>
plt.figure(figsize=(8,8)) sb.heatmap(df.corr(),annot=True,cmap='RdBu_r') plt.title('the correlation between the variabeles')
df.head()
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | 0.0 | 0.0 | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 7/1/2015 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | 0.0 | 0.0 | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 7/1/2015 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 0.0 | 0.0 | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 7/2/2015 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | 0.0 | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 7/2/2015 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | 0.0 | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 7/3/2015 |
5 rows × 32 columns
df.reserved_room_type.value_counts()
A 85873 D 19179 E 6519 F 2894 G 2092 B 1115 C 931 H 601 L 6 Name: reserved_room_type, dtype: int64
data=df.query('is_canceled == 0')
plt.figure(figsize=(12, 8))
sb.boxplot(data=data ,x='reserved_room_type',
y='adr' , hue='hotel')
plt.title("Price of room types per night and person", fontsize=16)
plt.xlabel("Room type", fontsize=16)
plt.ylabel("Price [EUR]", fontsize=16)
plt.legend(loc="upper right")
plt.ylim(0, 600)
plt.show()
resort_hotel = df.query('hotel == "Resort Hotel"').groupby('arrival_date_month')['adr'].mean().reset_index()
city_hotel = df.query('hotel == "City Hotel"').groupby('arrival_date_month')['adr'].mean().reset_index()
resort_hotel , city_hotel
( arrival_date_month adr
0 April 77.849496
1 August 186.790574
2 December 69.051887
3 February 55.189716
4 January 49.507033
5 July 155.181299
6 June 110.481032
7 March 57.554652
8 May 78.758134
9 November 48.313643
10 October 62.132572
11 September 93.252030,
arrival_date_month adr
0 April 111.397415
1 August 114.857330
2 December 89.209560
3 February 85.327519
4 January 82.754477
5 July 110.945950
6 June 119.186056
7 March 92.973339
8 May 121.764614
9 November 88.372486
10 October 100.119313
11 September 110.120296)
data_final = resort_hotel.merge(city_hotel , on ='arrival_date_month')
data_final.columns=['month','price_for_resort','price_for_city_hotel']
data_final
| month | price_for_resort | price_for_city_hotel | |
|---|---|---|---|
| 0 | April | 77.849496 | 111.397415 |
| 1 | August | 186.790574 | 114.857330 |
| 2 | December | 69.051887 | 89.209560 |
| 3 | February | 55.189716 | 85.327519 |
| 4 | January | 49.507033 | 82.754477 |
| 5 | July | 155.181299 | 110.945950 |
| 6 | June | 110.481032 | 119.186056 |
| 7 | March | 57.554652 | 92.973339 |
| 8 | May | 78.758134 | 121.764614 |
| 9 | November | 48.313643 | 88.372486 |
| 10 | October | 62.132572 | 100.119313 |
| 11 | September | 93.252030 | 110.120296 |
# sort columns of month by import sort_dataframeby_monthorweek
import sort_dataframeby_monthorweek as sd
data_final=sd.Sort_Dataframeby_Month(df=data_final,monthcolumnname='month')
data_final
| month | price_for_resort | price_for_city_hotel | |
|---|---|---|---|
| 0 | January | 49.507033 | 82.754477 |
| 1 | February | 55.189716 | 85.327519 |
| 2 | March | 57.554652 | 92.973339 |
| 3 | April | 77.849496 | 111.397415 |
| 4 | May | 78.758134 | 121.764614 |
| 5 | June | 110.481032 | 119.186056 |
| 6 | July | 155.181299 | 110.945950 |
| 7 | August | 186.790574 | 114.857330 |
| 8 | September | 93.252030 | 110.120296 |
| 9 | October | 62.132572 | 100.119313 |
| 10 | November | 48.313643 | 88.372486 |
| 11 | December | 69.051887 | 89.209560 |
px.line(data_final, x='month', y=['price_for_resort','price_for_city_hotel'], title='Room price per night over the Months')
# same plot by seaborn
plt.figure(figsize=(12, 6))
sb.lineplot(x = "month", y="price_for_resort", data=data_final,label='Resort')
sb.lineplot(x = "month", y="price_for_city_hotel", data=data_final,label='City_hotel')
plt.title("Room price per night and person over the year", fontsize=16)
plt.xlabel("Month", fontsize=16)
plt.xticks(rotation=45)
plt.ylabel("Price [EUR]", fontsize=16)
plt.legend()
plt.show()
# get number of acutal guests by country
country_data = pd.DataFrame(df[df['is_canceled'] == 0]['country'].value_counts())
country_data.rename(columns={'country' : 'Number_of_guests'} ,inplace=True)
total_guests = country_data.Number_of_guests.sum()
country_data['Guests_in_%'] = round(country_data.Number_of_guests / total_guests *100 ,2)
country_data["country"] = country_data.index
country_data.loc[country_data["Guests_in_%"] < 2, "country"] = "Other"
#pie_plot
fig=px.pie(country_data ,values='Number_of_guests' ,names = 'country' , title='Home country of guests' ,template='seaborn')
fig.update_traces(textposition="inside", textinfo="value+percent+label")
fig.show()
Portugal is the most booked country
the rate of cancellation of reservation increasees every year but at simple rates